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Abstract 

In this work, we analyze the learning ability of diffusion-based distributed learners that receive a continuous 
stream of data arising from the same distribution. We establish four distinctive advantages for these learners relative 
to other decentralized schemes. First, we obtain closed-form expressions for the evolution of their excess-risk for 
strongly-convex risk functions under a diminishing step-size rule. Using the result, we then show that the distributed 
strategy can improve the asymptotic convergence rate of the excess-risk by a factor of N relative to non-cooperative 
schemes, where N is the number of learners in the ad-hoc network. We further show that the fastest attainable rate 
of convergence matches the Cramer-Rao bound (up to constants that do not depend on TV or the iteration number 
i) under some mild regularity conditions on the distribution of the data. Finally, we show that the diffusion strategy 
outperforms consensus-based strategies by reducing the overshoot during the transient phase of the learning process 
and asymptotically as well. In Ught of these properties, diffusion strategies are shown to enhance the learning ability 
of ad-hoc distributed networks by relying solely on localized interactions and on in-network processing. 

Index Terms 

distributed stochastic optimization, maximtun-likelihood learning, diffusion strategies, consensus strategies, Cramer- 
Rao bound 

I. Introduction 

Data mining is a formidable example of an application that involves searching through data generated over 
time by a single agent or by a multitude of agents. For example, search engines utilize user queries to target 
advertising based on the user's search keywords. Online retailers suggest items based on the user's shopping history 
and even on the correlation between the user's shopping history and that of other shoppers. Likewise, online 
video services suggest video content based on the user's viewing history and on the viewing history of similar 
customers. It is clear that data mining applications benefit from leveraging information from different users. It can 
be advantageous to collect the information from all users at a central location for processing and analysis. Many 
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current implementations rely on this centralized approach. However, the rapid increase in the number of users, 
coupled with privacy and communication constraints related to transmitting, storing, and analyzing data information 
at remote central locations, have been serving as strong motivation for the development of decentralized solutions 
to learning and data mining ||T|-||9). 

In this work, we concentrate on studying the distributed real-time prediction problem over a network of N 
learners. We assume the network is connected, meaning that any two arbitrary agents are either connected directly 
or by means of a path passing through other agents. Such networks are common and serve as useful models for 
peer-to-peer networks and social networks. The objective of the learning process is for all nodes to minimize 
some strongly-convex objective function termed as the risk function in a distributed manner We shall compare the 
performance of the distributed strategies to that of a centralized algorithm that can optimize the risk and determine 
or approach the optimal solution. We designate the gap between the risk achieved by the distributed solution and 
the risk achieved by the optimal solution as the excess-risk. 

We will propose distributed strategies that will be shown to converge in the mean-square-error sense to the 
optimal solution when a decaying step-size sequence is used. We will also show that the distributed algorithm will 
asymptotically converge with probability one to the optimizer of the risk function when a decaying but square- 
summable step-size sequence is used. We further show that the distributed solution can achieve <d{l/Ni) convergence 
rate asymptotically where i is the iteration number We derive a closed-form expression for the asymptotic excess- 
risk performance of the distributed algoritms. Using the derived expressions, we examine the convergence rate of 
the distributed solution and show that the diffusion strategy is asymptotically efficient (up to a constant) in the 
Cramer-Rao sense under some mild regularity conditions on the probability distribution of the noise. This useful 
conclusion would mean that no algorithm, centraUzed, non-recursive, or otherwise that has access to all the data 
from all the nodes at every time step can achieve better convergence rate performance as i — )■ oo. 

We may mention that previous works in the machine learning literature have generally required some special 
structure in the network such as in |10|, where the network obeys a master/worker architecture, and in pT| , where 
communication must take place over a bounded-degree acyclic graph. In the algorithms proposed in | [T0| , pT) , 
the risk function is not assumed to be strongly-convex. For this reason, the algorithms only achieve excess-risk 
improvement on the order of 0{l/ViN). This effect is related to the fact that the lack of strong-convexity causes 
a non-cooperative solution to at best have 0{1/Vi) performance for i iterations. It is unclear if the algorithms 
proposed in pO) , pT] can converge at the rate 0{1/Ni) if the risk function is strongly-convex. 

Furthermore, while diffusion and consensus-based algorithms have the same computational complexity, we will 
establish that diffusion algorithms actually reduce the overshoot caused by the use of large initial step-sizes and 
perform better than consensus-based algorithms asymptotically as well. This means that while consensus-based 



algorithms can achieve a convergence rate of Q{l/Ni) asymptotically |12|, diffusion-based methods also achieve 
this rate, reduce the transient term faster in order to be dominated by the asymptotic term sooner, and lead to 
lower excess-risk than the consensus algorithm. This result implies that the useful algorithms in p2^ will have 
worse performance than diffusion strategies even though both algorithms may still achieve <d{\/Ni) convergence 
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rate asymptotically. 

A. Outline and Summary of Results 

For the benefit of the reader, we provide here a summary of the main contributions of this work: 

1) We propose a distributed algorithm for learning over networks that relies on the use of diffusion strategies 
(see (|9|-([T0|). We estabUsh that this algorithm converges almost surely to the desired optimizer (see Theorem 

[D- 

2) We derive an expression for the evolution of the excess-risk of the algorithm over time (see ((35]l). We determine 
useful closed-form asymptotic expressions for the excess-risk under different conditions (see Theorems 2][3i. 



3) We quantify the asymptotic convergence rates of each term in the excess-risk expression and show that these 
rates can be one of three possibilities depending on a certain threshold (see Table 

4) We then show how to speed up the convergence rate through the optimal selection of the algorithm's 



combination weights (see Sec. IV i. 



5) We show that the diffusion strategy leads to an A^-fold improvement in convergence rate relative to the 



non-cooperative strategy where nodes perform learning individually and without cooperation (see Sec. IV i. 
We also show that this A^-fold improvement matches the improvement that a centralized stochastic gradient 
learner provides (see Corollaries T]|2 1. We further show that no other learning algorithm can provide a better 



convergence rate beyond a constant that does not depend on N nor i. 
6) We compare two classes of distributed learning strategies: diffusion and consensus. We show that the latter has 
worse transient performance and that its asymptotic excess-risk curve is lower-bounded by that of diffusion 
(see Sec. |V]). 

II. Problem Formulation and Algorithm 

Consider a network of N learners. Each learner k receives a sequence of independent data samples x^ i, for 
1 = 1,2,... arising from some fixed distribution X. The goal of each agent is to learn a vector w" that optimizes 
some loss function Q{w,Xk,i) on average. For example, in order to learn the hyper-plane that best separates 



feature data h^ i belonging to one of two classes y^.i G {+1,-1}, a support-vector-machine (SVM) p3|, |14| 



would minimize the expected value of the following loss function over w (with the expectation computed over the 



distribution of the data x^.i = {hk.i,yk.i} ^ X) \15 



Q''''^{w,hk,^,yk,^)^^\\wf+m^xx{Q,l-yk,^hl.^w} [SVM] (1) 

where p > is a regularization constant. While the SVM loss function is not differentiable, other differentiable 
classifiers can be used such as the regularized logistic regression |l6j 

g''i-^(w, hk,^,yk,^) = ^\\wf + log(l + e-^''-''^-"') [Regularized logistic regression] (2) 
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or the quadratic loss 1 17 pp. 163-166] (also referred to as the "delta rule" jTS]): 

hk,^,yka) ^ {yk,^ - K^^wf [Quadratic loss] (3) 
More generally, the expectation of the loss function over the distribution X of the data is referred to as the risk 



function 1 19 p. 18] 



J{w) ^ ¥.x{Q{w,Xk.i)} 



[risk function] (4) 



The risk function is the target function to be minimized over the parameter vector w. We refer to the optimizer of 
Q as w°: 

w° ^ argmin J{w) (5) 

where w° is unique when J{w) is strongly-convex. In order to measure the performance of each learner, we define 
the excess-risk (ER) at node k as: 

¥Rki^) = nJ{wk,^-l)- J{w°)} (6) 

where denotes the estimator of w" that is computed by node k at time i (i.e., it is the estimator that is 

generated observing current and past data within the neighborhood of node k). The excess-risk serves as a measure 
about how well the estimate w^.i-i will perform on a new sample x^.i ^ A" on average. For this reason, the 
excess-risk is also referred to as the generalization ability of the classifier The excess-risk is non-negative because 
J(w) is strongly-convex and, therefore, J{w') > J{w°) for all w' ^ w° . 

One way to optimize (|5]) is for each node k to implement a stochastic gradient algorithm of the following form 

Wk,i^WkA^i-fJ.{i)\^wQ{wk,i-i,Xk^i) [no cooperation] (7) 

where Vu,Q(-) denotes the gradient vector of the loss function, and > is a step-size sequence. The gradient 
vector employed in (|7]l is an approximation for the actual gradient vector, V Ju,( ), of the risk function. The difference 
between the true gradient vector and its approximation used in (|7]) is called gradient noise. Due to the presence of 
the gradient noise, the estimate Wk,i becomes a random quantity; we shall use boldface letters to refer to random 
variables throughout our manuscript, which is already reflected in our notation in (|7]i. 



It shown shown in pTj , |22| that for strongly-convex risk functions J{w), the non-cooperative scheme (|7]l 
achieves an asymptotic convergence rate of the order of 0{l/i) under some conditions on the gradient noise and 
the step-size sequence In this way, in order to achieve an excess-risk accuracy of the order of 0(e), the 

non-cooperative algorithm ^ would require 8(l/e) samples. It is further shown in [ pT) , p3\ that no algorithm 
can improve upon this rate under the same conditions. This implies that if no cooperation is to take place between 
the nodes, then the best asymptotic rate each learner would hope to achieve is on the order of 9(l/i) where the 
notation Q{g{i)) means that there exist positive constants ci, C2 such that cig{i) < Q{g{i)) < C2g{i) for sufficiently 
large i. 
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On the other hand, assume the nodes transmit their samples to a central processor, which executes the following 
centralized algorithm: 

(i) " 

w, = Wi^i — ^ VwQ{wi^i,Xk,i) [centralized] (8) 

k = l 

It can be shown that this implementation will have an asymptotic convergence rate of the order of 0{1/Ni) for 
step-size sequences of the form = fi/i and for some conditions on /i (see Corollary |2|l. In other words, the 
centralized implementation (|8]l provides an iV— fold increase in convergence rate relative to the non-cooperative 
solution (|7]). One of the questions we wish to answer in this work is whether it is possible to derive a fully 
distributed algorithm that allows every node in the network to converge at the same rate as the centraUzed solution, 
i.e., 0{1/Ni), with only communication between neighboring nodes and for general ad-hoc networks. We show 
that this task is indeed feasible. We additionally show that each node in the network will converge at this rate in 
high probability and that the algorithm will achieve the Cramer-Rao convergence rate up to some constants that do 
not depend on the number of nodes N or the iteration number i. 

A. Diffusion Strategy 

Following the approach of it is possible to derive the diffusion strategy for the distributed evaluation of 
estimates Wk^i by the various nodes in the network listed in (|9|-([T0|; these estimates serve as approximations for 
the minimizer Q: 

ipk,i = Wk,t-i - pi{i)^niJ{wk,t-i) [adaptation] (9) 

N 

Wk^i = ^aek4>e,i [aggregation] (10) 

where V^J( ) is an instantaneous approximation for the true gradient vector Vu,J( ). Each node k begins with 
an estimate Wk.o and employs a diminishing positive step-size sequence The non-negative coefficients {aik}, 
which form the left-stochastic N x N combination matrix A, are used to scale information arriving at node k from 
its neighbors. Therefore, the coefficients satisfy: 

N 

"^^aik — l, ttkk > 0, ttik—O if nodes £ and k are not connected 

e=i 

The neighborhood Afk for node k is defined as the set of nodes for which agk 7^ 0. The main difference between 
the above algorithm and the adapt-then-combine (ATC) diffusion strategy proposed in [9| is that we are employing 
a diminishing step-size sequence fi{i) as opposed to a constant step-size. Constant step-sizes have an advantage in 
that they allow nodes to continue adapting their estimates in response to drifts in the underlying data distribution. 

On the other hand, a consensus-based algorithm Q, | [24) for the minimization of Q performs the adaptation 
and aggregation steps simultaneously as follows: 

N 

Wk,i = ^attif£,i-i - fi{i)'V.u]J{wk,i-i) [consensus] (11) 
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The diffusion and consensus strategies (|9ll-([TT| have exactly the same computational complexity, except that the 
computations are performed in a different order. We will see in Sec. |V] that this difference enhances the performance 
of diffusion over consensus. Moreover, in the constant step-size case, the difference in the order in which the 
operations are performed leads to an anomaly in the behavior of consensus solutions in that they can become 
unstable even if all individual nodes are able to solve the inference task in a stable manner; see | |25| . 

In order to proceed with the analysis of the distributed solutions, it is necessary to introduce an assumption on 
the risk function J(w) — specifically, we assume that J{w) is strongly-convex. 

Assumption 1 (Bounded Hessian matrix): The risk function J(w) is twice continuou-sly-differentiable and the 
Hessian matrix of J(w) is uniformly bounded from below and from above, namely, 

Ami„/ < ^lJ{w) < A„,ax/ (12) 

where < Amin < Amax < oo. ■ 

As sumption [T] is equivalent to assuming that is strongly-convex with a Lipschitz continuous gradient function, 
as is commonly assumed in the literature pO) , pTj , p6) . The upper bound in ([12]) is equivalent to 

||V„ J(x) - J(y)|| < Aniax • \\x ~ v\\ (13) 

We also need to introduce an assumption on the approximation used for the gradient vector As indicated before, 
the instantaneous gradient at node k and time i is defined in terms of the gradient of the loss function, i.e., 

\I yjJ(wu^i-\) ^V^Qiwk^i-i^Xk^) (14) 



Assumption 2 (Gradient noise model): We model the approximate gradient vector as; 

vZj{w) V^Jiw) + Vk^,{w) (15) 

where, conditioned on the past history of the estimators, J-i-i = {w^j : k — 1, . . . ,N and j < i — 1}, the gradient 
noise Vk.i{w) satisfies: 

EK,,(io)|J-,_i} = 

\vk,^{w)\\^\^^-l} < a ■ \\w° -wf + al (16) 



for some a > 0, al > 0, and where w E J^i-i- ■ 
Since nodes sample the data in an independent and identically distributed (i.i.d.) fashion from the distribution X, 
it is reasonable to expect the gradient noise to be uncorrelated across all nodes, i.e., 

E{i;fc,.(w°)V.K)} = 0, Vfc^£,Vz (17) 



Finally, we assume that the combination matrix A is left-stochastic and primitive |26 p. 730] |27 p. 516]; every 
connected network with at least one nontrivial self-loop (i.e., with Ukk > for some k) automatically ensures a 
primitive A ||28[. 
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Assumption 3 (The combination matrix A is primitive): The left-stochastic combination matrix A is assumed to 
be primitive (i.e., all entries of A are non-negative and all entries of A™ are strictly positive for some integer 
TO > 0). ■ 
Now consider the expected excess-risk (|6| at node k. This risk value allows us to assess the generalization ability of 
the classifier Wk.i-i on the yet unobserved data Xk.i- Using the following sequence of inequalities, we can bound 
the excess-risk by a useful square weighted norm: 

ERfc(0 ^ E^{Jiwk.,-i) - Jiw")} 

(a) 



(b) 



[ t fw'^J{w°-s t Wk^t-i)dsdt 
Jo Jo 

[ tf V'^J{w°-s t Wk,i-i)dsdt 
Jo Jo 



= Kv,\\Wk,-l\ 



J ^max TT-, II - II 2 

< -^r- ■ KzuWwk.t-iW 



Wk.i-l 



(18) 
(19) 



where w^.i ^ w° — Wk^i, E^{-} denotes expectation over the distribution of w, and steps (a) and (b) are a 



consequence of the following mean-value theorem from 1 21 p. 24] for an arbitrary real-valued differentiable function 



/(•): 



f{a + b) = f{a) + / V"^/(a + t ■ b) dt ■ b 



(20) 



Step (c) is a consequence of the fact that w° optimizes J{w) so that J(w°) = 0. Step (d) is due to ^T2\ in 
Assumption [T| Finally, the weighting matrix Sk.i that appears in ( [TS] ) is defined as; 

Sk,i ^ [ t [ \/^J{w° - S t Wk,^-l)dsdt (21) 

L"'o Jo 

Expression ([TSj shows that the expected excess-risk at node k is equal to a weighted mean-square-error with weight 
matrix ( pTj i. This means that one way to compute or bound the expected excess-risk is by simply evaluating weighted 
mean-square-error quantities of the form ( fTSj l or ([T9|. This is the route that we will take in this manuscript, where 
we will analyze the right-hand side of ( [T8] l in order to draw conclusions regarding the evolution of the expected 
excess-risk. In particular, once we establish that the distributed algorithm converges in the mean-square sense, 
then inequality ( [T9] l would allow us to conclude that the algorithm also converges in expected excess-risk. After 
establishing convergence of the expected excess-risk, we can then proceed to show convergence in high probability 
of the excess-risk itself. 

Introduce the weight-error matrix: 

1. 



(22) 
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In the next sections, we will show that the diffusion algorithm allows the estimates W]^ i^i to converge almost 
surely to it;° as i cx) for step-size sequences of the form = /i/i. It will then follow from ( |2T| i and (|22]) that 
Sk,i almost surely as well. For this reason, we can asymptotically approximate the expected excess-risk by 
the expression: 



lmlERki^)^E^\\wk,^-l\\i; (23) 

where 

III. Main Convergence Results 

In this section, we show that the distributed diffusion strategy (|9ll-([T0jl converges both in the mean-square sense 
and almost surely. Subsequently, we establish that the algorithm achieves Q{l/Ni) convergence rate asymptotically. 

A. Asymptotic Behavior 

Our first result provides conditions on the step-sizes under which the diffusion algorithm converges in the mean- 
square-error sense and almost surely. The difference between the two sets of conditions that appear below is that 
in one case the step-size sequence is additionally required to be square-summable. 

Theorem 1 (Asymptotic convergence): Let Assumptions T][3 hold and let the step-size sequence satisfy 



> 0, V = oo, lim ^{i) = 0. (24) 

Then, Wk^i converges in the mean-square-error sense to w°, i.e., 

¥.\\w° -WkA? (25) 
If the step-size sequence satisfies the additional square-summability condition: 

CO 

5Zm'W<o«' (26) 

1=1 

then Wk.i converges to w" almost surely (i.e., with probability one) for all fc = 1, . . . , iV. 



Proof:: See Appendix A- A 



Observe that ( |25| l implies that each node converges in the mean-square-error sense. Combining this result with 
( [T9| l, we conclude that each node also converges in expected excess-risk. Note that this conclusion only depends 
on Assumptions [Tp] it does not require the approximation (|23). 



B. Asymptotic Approximation 

The next question we address is to quantify the benefit of node cooperation. To do so, we assume the step-size 
sequence is selected as = ^/i for some /i > 0. This sequence satisfies conditions ( p4j i and p6l ). We collect 
the weight-error vectors from across the network into the column vector: 

Wi = co\{wi^i, WN,i\, (27) 
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and introduce the block quantities: 

A^A(S)Im, gi{w) ^ col{vi,i{w),. . . ,VN,i{w)} (28) 

where €5 denotes the Kronecker product operation. We designate the covariance matrix of the gradient noise vector 
by: 

7^„^, ^ E{g,{wk,,^i)gJ {Wk^.^i)} (29) 

Now, extending the arguments of jOj, we can verify the vaUdity of the following recursion, which relates weighted 
variances of two successive network error vectors, Wi^i and Wi^2' 



nw^-l\\l 



(30) 



where 



(31) 



= A" [Imn - J 
= diag|^ vyK-iu>i,)dt,...,^ 

where the notation denotes the square weighted norm wJXwi. Observe that recursion ([30| depends on the 

random coefficient matricies Bj. Using Theorem [T] however, we see that 

Hi — !• Hoo, almost surely 

where 

Hoo = /at Vy 
We conclude that Bt converges almost surely to 

B, ^ ^ A'' [Imn ~ m(*)^L] , (32) 
Furthermore, the covariance matrix TZ^^i in (|29| will almost surely converge to 

TZy ^In<S)Rv (33) 
= In <S>E{vk4w'')vl,{w'')} (34) 
since we assumed that the gradient noise is uncorrected across the nodes ( [T7| ), the nodes sample their data from a 



common stationary distribution, and the loss function is time-invariant. Under conditions (|32|l-(|33|l, we observe that, 
asymptotically, the variance relation ([30| becomes a deterministic recursion with the randomness in Bi removed. 
If we unfold the recursion we arrive at the following expression: 




Transient Term 



Asymptotic Term 



(35) 
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where 



Notice that the first term on the right-hand side of ( (35| l models the transient behavior of the diffusion network since 
it is governed by the initial error in estimating w° at the different nodes. In comparison, the second term on the 
right-hand side of ( (35] l models the asymptotic behavior of the network. It is necessary to study the behavior of 
both terms to understand the dynamics of the network. Different choices for S correspond to different measures of 
performance p9) . In order to evaluate the expected excess-risk risk at node k, we choose E as 

1 



(36) 



where the quantity E^k denotes the N x N matrix with a single 1 at the (fc, fc)-th element and all other elements 
equal to zero. If, on the other hand, we are interested in recovering the average expected excess-risk across the 
network, then we select S instead as 



(37) 



In order to facilitate the analysis, we introduce the eigenvalue decomposition of the Hessian matrix of J{w) 
evaluated at w", and the Jordan canonical form of the combination matrix A: 



A = TDT-^ 



(38) 



where $ is an orthogonal matrix and A is diagonal with positive entries. Moreover, since the matrix A is left- 
stochastic and primitive, it has a single eigenvalue at one, and the remaining eigenvalues have magnitude strictly 



less than one |27 p. 514], p8| . We denote the right eigenvector of A corresponding to the eigenvalue at one by 
p and normalize its entries to add up to one, i.e., l^p = 1. Therefore, we can express the Jordan decomposition 
(|38ll of A as 



A = pt^ + YDn-iR^ 



(39) 



p Y 



D 





1^ 







(40) 



where R,Y E M^^^ ^ represent the remaining left and right eigenvectors while -Djv-i represents the Jordan 
structure associated with the eigenvalues other than one. This can be seen by partitioning T, D, and T^^ as 

1 

Dn-1 

We can now study the convergence rate of the transient and asymptotic terms in ( [35| l. The following result shows 
that the transient term can die out at different rates depending on the value of 2AininM- 

Theorem 2 (Convergence rate of transient term): Let Assumptions T]|3 hold. The transient term in ( (35| ) decays 
asymptotically according to 



e|---e[_iSe,_i--Bi 



as I — > cx) 
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where the notation f{n) — Q{g{n)) implies that the sequence f{n) decays at the same rate as g{n) for sufficiently 
large n (i.e., there exist positive constants ci, C2 and integer uq such that cig(n) < f{n) < C2g{n) for all n > uq). 
The constant Xnun — min{Ai, . . . , \m} is the smallest eigenvalue of J(w°). 

Proof:: See Appendix |A-B ■ 



Actually, in App. A-B we derive upper and lower bounds for the transient term (see ( |88] l and (|89|l). Examining 
these bounds, we notice that the transient term will grow initially before it starts to decay. The asymptotic rate 
of decay of the transient error is on the order of j-2A„i„/j^ ^^j. y^J^g Qf 2AminAi- We will see next that when 
2Aniin/i > 1, the transient term is not the dominant rate since the asymptotic term will converge at slower rates. 

We can also examine the asymptotic convergence rate of the second term on the right-hand-side of ( [35] l. This 
term is in fact the slowest converging term in all cases except when 2Aniin/i < 1, when the convergence rate will 
match that of the transient term. 



Theorem 3 (Convergence rate of asymptotic term): Let Assumptions T][3 hold. Let the weighting matrix S = 
\Ekk ® J(w°). It then holds asymptotically that 

,2 



j-1 



i-1 



n F n 



V 



2 



M 



•Ibll2 



(41) 



\m—l 



where the notation f{i) ^ g{i) impHes that lim 



m 

9{i) 



1 and am{i) is defined as 



5rn{i), 

3F2(l,l,l\2-X^^l,2-\rr^^l■,l)■Sm^{i) 



(42) 



2Am^ > 1 \, 2XrafJ. > 1 

amii) ^ { S,Ji), 2A„,Ai = 1 , Srnii) = I 2A,„/i = 1 

2Xml-l < 1 ^^2X~7r, '2'X,nfl < 1 

where r(-), 3-^2(01, 02, 03; 6i; 62; z) are the Gamma and generalized hypergeometric functions |30 pp. 892,1010], 
respectively, and A,„ is the m-th eigenvalue of W^J{w°). Moreover, the notation {X)mm denotes the m-th diagonal 
element of the matrix X. Furthermore, with probability at least 1 — we have that 

2 / M \ 



2u 



^ A„ia„i(i)($'^i?„$)mm • IIPII2, as z cxD 



when 2Amin/i > 1, and > is a small positive number. 



Proof:: See Appendix A-C further ahead. ■ 
Theorem|3]establishes a closed-form expression for the asymptotic excess-risk of the diffusion algorithm. We observe 
that the slowest rate at which the asymptotic term converges depends on the smallest eigenvalue of J(w°) and 
the constant /i. When 2A,iiin/i = 1, the asymptotic term will converge at the rate log(i)/i. When 2AininM < 1' the 
asymptotic term will converge at the rate j-2A,„i„p^ which can be arbitrary small. Table |l] lists the convergence rates 
of the transient and asymptotic terms from (|35|l. 
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TABLE I 

Listing of the convergence rates of the transient and asymptotic terms in j35) under different conditions on 2AminA'- 

The shaded cells indicate the slowest converging term. 





Transient Term 


Asymptotic Term 


2AminM > 1 


e(i-2^mi„M) 


e(i-i) 












e(i-2^n,i„M) 



We see that in all cases, the asymptotic term is dominant, and its rate is only matched when 2AininM < 1- The 
shaded cells indicate the slowest converging term under each condition on 2Aniin/i. Clearly, it is best to choose a 
larger value for /i to satisfy 2A,nin/^ > 1 in order to attain the fast convergence rate of 1/i asymptotically. When 
Ainin is not known, and thus it is not clear how to choose ^ to satisfy 2AininM > 1' it is common to choose a large 
/X that forces 2A,ni„/i ^ 1. In this case, we get 



This approximation is close in form to the steady-state performance derived for the diffusion algorithm when a 
constant step-size /i is used | |3Tj . The main difference is that the "steady-state" term will diminish at the rate 1/i 
when = n/i and 2Ai„inM ^ 1- Moreover, under some mild conditions on the distribution of the gradient noise, 
it is possible to show that the diffusion strategy achieves the Cramer-Rao bound, il{l/Ni), asymptotically, up to 



some constant not dependent on N or i (see Sec. IV further ahead). Finally, since the expected excess-risk agrees 
with the asymptotic term (since the asymptotic term is the slowest decaying term) when 2A,nin/i > 1, we have 
therefore shown that the expected excess-risk decays according to ( |4T] ) when 2AininA' ^ 1- We further strengthened 
this statement to state that the excess-risk itself, not just on average, decays at this rate in high probability. 

By specializing the previous results to the case iV = 1 (a stand-alone node), we obtain as a corollary the following 
result for the expected excess-risk that is delivered by the traditional stochastic gradient algorithm (also known 
as Robbins-Monro stochastic approximation). This result establishes a closed-form expression for the asymptotic 
excess-risk performance of stochastic gradient descent. 

Corollary 1 (Stochastic gradient approximation): Let = 1 in (|9|)-(fT0l). Then, the algorithm reduces to the 
following stand-alone stochastic gradient descent recursion: 

Furthermore, let Assumptions l|2 hold with = [iji. Then, the excess-risk asymptotically satisfies: 



2 

E^{ J(-a;,_i) - J(zz;°)} ~ e(i-2>— ^) + ^ (44) 



2 

m— 1 
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where the notation f{i) ^ g{i) implies that lim — — = 1 and am{i) is defined as 

a„,[i)^l^})^ 2A„/i = l (45) 

3F2(l,l,l;2-A^M,2-A,„;l) 1 r, , , i 

and the bounds for Q(i-2A,„i„p-| ^j.g given by (|88|l-(|89ll using p — 1 and = 1. ■ 
Furthermore, we can show that the centralized algorithm (|8]l achieves an A^-fold improvement in convergence speed 
in comparison to the non-cooperative algorithm (|7]i: 

Corollary 2 (Centralized processing): Let Assumptions T]|2 hold with — ^/i. Consider the centralized algo- 



rithm ([8]), which has access to all the data across the network at each iteration. Then, the excess-risk asymptotically 
satisfies: 

1 2 

E«,{ J(t«.-i) - JiW)} ^ e(i-2^™'') + ^ • Y 51 A,„a„(i)($^i?„*)m™ (46) 

m— 1 

where the notation f{i) ^ g(i) implies that lim —r^ = 1 and am(i) is defined as in (|45]l. Specifically, the algorithm 

i^oog{i) 

converges at the rate <d{l/Ni) when 2^A,nin > 1- 

Proof:: We substitute the gradient approximation model from Assumption |2] into ([8]): 



I N ^-^ 

k = \ 

Wi-x - - (V„ J(wi_i) + qi(u?i_i)) (47) 



N 



where 

N 



N ■ 
fc=i 

is the effective noise. Since the algorithm converges asymptotically and the estimate converges Wi-i — > w° almost 
surely, the covariance matrix of the noise, due to ^V7\ , is asymptotically characterized by 

i?, =E{q,K)q,(w°)T} = li?,, asi^oo (48) 

Due to the correspondence between ( |47] l and (|7]l, we see that the only difference between the non-cooperative 

algorithm (|7]i and the centralized algorithm ([8]) is that the asymptotic noise covariance of the centraUzed algorithm 

Rq is that of the non-cooperative algorithm Ry scaled by a factor of 1/N. Therefore, we can use Corollary [T| with 

( |48| l in place of Ry to obtain the desired result. ■ 

IV. Benefit of Cooperation 

Up to this point in the discussions, the benefit of cooperation has not yet manifested itself; this benefit is actually 
encoded in the vector p. Optimization over p will help bring forth these advantages. Thus, observe that the expression 
for the asymptotic term in Theorem |3] is quadratic in p. We can optimize the asymptotic expression over p in order 
to speed up the convergence rate. Observe from expressions (|4T|-(|42]l that, in each of the three cases qualified by 
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the value assumed by 2Am/i, the slowest rate of convergence of ( |4T] ) to zero is dictated by the value of Smii) that 
corresponds to Amin- We denote the value of m that corresponds to A,„in by m,„in and define 



log(») 



2AminM > 1 
2Ainin/i = 1 
2Aniin/i < 1 



as well as 



E 

{m:6jn{i)—6rr, 



r. 



(49) 



Then we consider the problem of optimizing this slowest mode of convergence, namely, 



mm 



subject to Ap — p, l^p = 1, p>0 

where A denotes the set of left-stochastic and primitive combination matrices A that satisfy the network topology 
structure. It is generally not clear how to solve this optimization problem over both A and p. We pursue an indirect 
route. We first remove the optimization over A and determine an optimal p. Subsequently, given the optimal p", 
we show that a left-stochastic and primitive matrix A can be constructed that satisfies the network topology. The 
equivalent relaxed problem is: 



mm 

p 



Ml 



subject to l^p — 1, p > 
The above optimization problem is convex, and its solution is given by 



N 



(50) 



A combination matrix A that has this p° as the right-eigenvector associated with the eigenvalue at one is the 
following Metropolis rule p2|, p3|, which corresponds to a doubly-stochastic and symmetric combination matrix: 



1 1 
\^ft\' \^fk\ 



(51) 



otherwise 



To see the effectiveness of this choice for p, we substitute p° from (|50]l into ( |4T] i to find 

g^( n Bj]J n Bj 



V 




Amftm 



(52) 



Comparing the above result with the rightmost term in (|44]|, we observe an A^-fold improvement in the convergence 
rate when 2AininM ^ 1 (since the transient term will contribute to the convergence rate only when 2Ainin/i < 1). 
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Actually, the right-hand-side of ( |52| l is the performance attained by the centralized stochastic-gradient algorithm 
(|8]l when 2Aniin/^ > 1 (see (|46|). This implies that the diffusion algorithm will asymptotically achieve the same 
performance as (|8]l when 2Amin/i > 1. Specifically, when 2A„iin/i > 1, then the diffusion algorithm achieves an 
asymptotic convergence rate of Q{l/Ni). 

We remark that this result is interesting because it is asymptotically the fastest convergence rate attainable when 
the nodes sample the data in an i.i.d. fashion from a common distribution X. To see this, consider a causal, 
potentially centralized and non-recursive, learner with access to the information {xkj} where 

i.i.d. ^ 
Xk,j ~ X 

for k = {1, . . . , N} and the vectors {xkj} depend on the fixed parameter vector w" in some fashion. It is well 
known that the efficiency of the estimator Wi for w°, derived from {xkj, j < i}, can be characterized by the 
Cramer-Rao lower-bound 

E{W,WJ} > [FIMNetwork(l«°)]"' (53) 

where Wi ^ w° — Wi assuming that W'^op{xk,i;'w°) exists for all w° E M^^ and Xk^i is in the support of 



p{xk,i;w°), where p{xk.i\w°) represents the likelihood function of Xk.i- Result ( |53] l holds when the following 
regularity condition |34, p. 170] 

EV^,olog(p(a;fc,,;u;°)) =0 (54) 

holds for all w°, and the Fisher information matrix FIMNetwoik(w°) is positive-definite. The network Fisher infor- 
mation matrix FIMNetwork(w°) that appears in (|53|l is defined as p5| p. 44]: 



FIMNetworklw'") - -E { log . . . , X^^, W°))] 

= -eJ V,2„„ \og{\{\{p{xkS.w°) 

N i 

= EE-lE{vlologb(a;,,,;u;°))} 

fc=i j=i 

= iVz.FIMs™pieK) (55) 
where FIMsampie(w°) is the Fisher information matrix of a single sample, defined by: 

FIMsa,nple(w'°) = -E {V.^^o log {p{x; W°))} (56) 

where x ^ X. Observe that the network Fisher information matrix FIMNetwork(w°) will be positive-definite so long 
as the sample Fisher information matrix (|56| is positive-definite. Substituting (|55]l-(|56| into (|53]l, we have: 



E{W,WJ} > ^ [FIMsample(«^°)]"' (57) 
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Inequality ( |57j l only gives a lower-bound on the mean-square-deviation from the optimizer w°. In order to transform 
this result into a lower bound for the convergence rate of the excess-risk, we recall from ( [T8] l that the excess-risk 
can be represented as a weighted mean-square-error Recalling Assumption [T] we obtain 

E^{JK) - JK)} > ^ • E^||m,||2 > ^ • ^ • Tr ([FIMs,^p,e(«;°)]~^) (58) 

This result implies that no algorithm can achieve a convergence rate in expected excess-risk or mean-square-deviation 
faster than il{l/Ni), where the notation f{i) = n{g{i)) implies that there exist positive constant c and integer no 
such that c- g{i) < f{i) for all n > uq. Since we showed in this section that the diffusion algorithm asymptotically 
achieves the rate ld{l/Ni), then this means that the fully distributed diffusion algorithm attains an asymptotically 
efficient rate up to a constant that does not depend on N or i. No algorithm, not even a Newton's method based 
algorithm, can achieve a faster rate aside from a constant that does not depend on i or N. The work in p2) showed 
a similar result by using a Newton-type consensus algorithm under a linear observation model. Therefore, from a 
convergence-rate perspective, gradient-descent constructions are asymptotically as efficient up to a constant factor. 
This result is consistent with the work by f23l which shows (under stronger assumptions such as Lipschitz risk 
functions) that no algorithm can converge faster than 1/T for T samples. If we apply this result for a centralized 
algorithm that has access to all the i.i.d. data {(yi.i, • ■ • , {yN,i, ^w.i)}' '^hen we conclude that no algorithm 

can converge faster than 1/Ni under the assumptions assumed in [23]. The discussion in the next section takes 
this comparison further and shows that there are some disadvantages to using consensus-type constructions |j2J, 
| [T2| , [36) in lieu of diffusion-type constructions in the context of learning tasks. In particular, we will show that 
the expected excess-risk of the diffusion-type algorithm will always be upper-bounded by the performance of a 
consensus-type gradient-descent algorithm. Furthermore, we will observe that the transient performance may be 
significantly worse for consensus implementations than diffusion implementations. 

V. Comparison to Consensus Strategies 

In this section, we show that the diffusion strategy has several advantages over the consensus strategy ( fTT) , which 
is commonly used in machine learning applications Q, p7) , while retaining the same computational complexity 
p5| . Specifically, we will analytically show that, asymptotically, the consensus excess-risk curve is worse than the 
diffusion excess-risk curve. In addition, we will show through simulation that the overshoot during the transient 
phase may be significantly worse for consensus implementations. 

The main difference between the dynamics of the diffusion and consensus implementations is in the definition 
of the Bi matrices in ( |32j l where 

Bt^^A'ilMN-K^m^) (59) 

gcons A _4T _ (gQ) 

The seemly small change in the order in which operations take place within the consensus and diffusion strategies 
leads to significant differences in the dynamics of the evolution of the error vectors over the networks leading to 
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worse transient and asymptotic performance for consensus strategies; these conclusions are consistent with resuhs 
for mean-square-error estimation p5] . To examine these differences in behavior, we will consider the average 
network excess-risk: 



1 ^ 



N 

k = l 

When 2AminM ^ 1' we can average the asymptotic terms from ( |35| l to obtain the following expression for the 
asymptotic network excess-risk (attained by substituting ( (37] i into the second term of ( |35| )) p9j 

^ ^E'^'(^)Tr (/® V^K)) ( n a ( n , as z -> oo (61) 

j=i \ \t=j+i / \t=j+i J J 

where Bt is either Bf^ or depending on which algorithm we wish to examine. Likewise, the matrix Q is 

either Q'^'^ or Q™"'^, depending on which algorithm we wish to examine: 

In this section, we will assume that the matrix A is symmetric and thus diagonalizable. This assumption is reasonable 



since we already showed in Sec. IV that the Metropolis rule optimizes the convergence rate and leads to a symmetric 
A. 

Assumption 4 (Symmetric A): The combination matrix A is symmetric. ■ 
This assumptions makes A diagonalizable and therefore the matrix D in its Jordan canonical factorization ([38| will 
now be a diagonal matrix. Extending the argument from p5] , we introduce the vectors and yk for fc = {1, . . . , N} 
that represent the right and left eigenvectors, respectively, of the matrix A^ corresponding to eigenvalue Dkk (the 
fc-th diagonal element of the matrix D in the decomposition of A in (|40ji): 



A'^rk = Dkkrk, yjA'^ = Dkk.yl 

where we normalize the vectors so that rjyk = 1. When A is double-stochastic, then Di i = 1, ri = Ijv, and 
yi = jft^. Furthermore, let {s„i,m = 1, . . . , Af} denote the eigenvectors of the matrix V^J(ti;°) 

where A™ is the m-th eigenvalue of W'^J{w°) and the eigenvectors are normalized so that |lsm||| = 1. We 
notice that the matrices Bf^ and share the same eigenvectors (but have different eigenvalues). We denote the 
eigenvectors of Bi by 2/f . They can be found to be: 

We summarize the variables in Table |ll] We can now establish the following result. 
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TABLE II 

Variables in the Diffusion and Consensus Implementations. 



Algorithm 


Diffusion l|9]l-lfT0ll 


Consensus ([TTJ 












Dkk ~ tJ'{i)^m 


Q 










llSmllijIj/felP 



TABLE III 

Simulation parameters for quadratic risk minimization 



N 


AI 


a 








•^max 


20 


2 


12 


1 


1.5 


2 


2 



Theorem 4 (Comparing network excess-risks): Let Assumptions |l|4j hold and /i > . Then, the asymptotic 
expected network excess-risk achieved by the diffusion strategy (|9ll-([T0|i is upper bounded by that achieved by the 
consensus strategy ([TTJ: 



ER'^'«(i) < ER'=°"^(2) as i ^ oo (62) 



Proof:: See Appendix A-D 



Result ( [62| l implies that, asymptotically, the curve for the expected network excess-risk for the diffusion algorithm 
will be upper-bounded by the curve for the consensus algorithm. Even though both algorithms have the same 
computational complexity, the diffusion algorithm achieves better performance because it succeeds at diffusing the 
information more thoroughly through the network. 

VI. Illustration of Results 

In order to illustrate our results, we consider two situations. First, we optimize the quadratic loss ([3| with a 
linear model and synthetic Gaussian-distributed data. Second, we optimize a regularized logistic loss (|2]| over data 
generated from the "alpha" dataset p8). 



A. Quadratic Risk 

Consider a mean-square-error risk function: 

J{w)=nvk{^~hlM'' (63) 

and assume the simulation parameters shown in Table |lll] Metropolis weights are used for combining estimates. 
The quadratic loss function optimized at each node is (|3]l, where hk^i is a random vector in K,*^^^ and is a Gaussian 
random vector with i.i.d. elements and zero mean and unit variance. In addition, the scalar observation yk{i) is 
generated as yk{i) — hj iW° + Vk{i), where Vk{i) is a Gaussian random variable with zero mean and unit variance. 
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Fig. 1. Comparison between learning curves of non-cooperative processing jTJ, diffusion algorithm l[9)- jl0) , and consensus-type algorithm 
for quadratic loss minimization. The simulation parameters are listed in Table [Tll] 

The curves illustrate that the difference in performance between non-cooperative processing Q and the diffusion 



algorithm presented in Section IV is about 13dB (10 log]^Q(Af)). We also observe that (|9|-([T0| achieves lOdB per 
decade decay in simulation. In comparison to the consensus algorithm ( [TT| , the diffusion algorithm (|9])-([T0]) is seen 
to have better transient performance, and the expected network excess-risk for the consensus strategy remains higher 
than that for diffusion (as predicted by Theorem Hh. 



B. Regularized Logistic Regression 

We next consider the minimization of the regularized logistic regression loss function (|2]l. We draw the feature 
and label data {hk^i,yk,i} randomly from the "alpha" dataset p8) . The optimal vector w° is computed using 
deterministic gradient descent on the empirical risk computed via: 

^ 500000 

Jemp(w) ^ gQQQQQ Qihn,yn,w) (64) 

n— 1 

where {hn,yn} are the n-th feature vector and label from the dataset. The simulation parameters are listed in 
Table |TV| and the simulation results are plotted in Fig. |2] The curves were averaged over 100 experiments and the 
Metropolis rule ( |5T| ) is used to combine the estimates at each node. Notice that the difference in performance between 



non-cooperative processing (|7]i and the diffusion algorithm presented in Section IV is about lOdB (10 log^ol^))- 
The excess-risk associated with the consensus strategy ( [TT| remains higher than that for diffusion (as predicted by 
Theorem |4|, and the diffusion algorithm asymptotically achieves the performance of the centralized algorithm ([8]) 
as predicted by our analysis. 
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TABLE IV 

Simulation parameters for regularized logistic regression simulation 



N 


M 


Tr(fl„) 




P 


10 


500 


1028.18 


0.20 


10 




2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 



logio(0 

Fig. 2. Comparison between learning curves of the non-cooperative processing jTJ, centralized algorithm jSj, diffusion algoiithm j9j-{lO}, and 
consensus algorithm for regularized logistic regression. The simulation parameters are listed in Table [Tv] 

VII. Conclusions 

We proposed a fully distributed algorithm for the optimization of a strongly-convex risk function. We studied 
its performance and established that the algorithm's excess-risk can converge at the rate 8(1/A^i) asymptotically, 
thereby matching the Cramer-Rao bound up to a constant that does not depend on neither N nor i. This is in contrast 
to the convergence rate attained when no cooperation takes place between the nodes {Q{l/i)). The algorithm's 
excess-risk performance matches that of the centralized algorithm (|8]l. Each node in the network converge, at 
the asymptotically optimal rate as well through local interactions. We also showed that the diffusion algorithm 
outperforms consensus-type algorithms. 

Appendix A 
Proof of Theorems 

In this appendix, we provide proofs for the various results demonstrated in the manuscript. The proofs of lemmas 
required for these derivations are provided in App. |B]if not directly referenced. 
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A. Proof of Theorem [7] 

We follow the approach of ||9J and extend it to handle diminishing step-sizes as well. We define the error vectors 



at node k at time i as: 



(65) 
(66) 



We subtract (|9ll-([T0ji from w° using ([TSj to get 

^/c,j ■Wk,x-i + • [VJ(wfc^j_i) + VkA{wk,i-i)] (67) 

N 

Wk,i = ^ att'0£,i (68) 
Using the mean-value-theorem for real vectors, we can express the gradient WJ(wk.i-i) in terms of Wk^i-i'. 



[ \/^J{'w°~tWk,,-i)dt 
Jo 



(69) 



where 

JQ 

Notice that WJ{w°) = since w° optimizes J(w). Substituting ( |69l ) into ( |67] l, we get 

'^'k.i ^[I ~ fi{i)Hk,i-i] Wk.i-i + fi{i)vk,i{wkA-i) (70) 
We now derive the mean-square-error (MSE) recursions by noting that WxW^ = cc^a; is a convex function of x. 



Therefore, applying Jensen's inequality 1 39 p. 77] to d68b we get: 



N 



E{\\wk4^\:F,^i} < J2 aikmi'uf\J'^-l}, k = l,...,N 
e=i 

From (|70| and using Assumption |2j we obtain 

E{||t/>fc,,||2|j-^_ J = E{||-(i;;, ,,_i|||^ JJ-,_i} + fi^ii) ■ E{||-!;fc,,(-«;;,,,_i)||' 

where 

Sfe,i = (/m - fi{i)HkA--if 
The matrices ^ can be shown to be positive semi-definite and bounded by: 

< Sfc,. < -ffhi 

where 



(71) 



(72) 



(73) 



7^ = max{|l - /l(i)A,nax| , |1 - M(*)^min|} 



(74) 
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Now note that the square of 7, from ( |74] i can be upper-bounded by: 

-ff = niax{l-2^(i)A,nax + Ai^(0-^max:l-2M(i)Amm+M^(i)A^in} 

In order to simplify the notation in the following analysis, we introduce the upper-bound 

j3i = l- 2^(i)A,nin + M^(«)(Amax + ") 

where a is defined in Assumption |2] Also, note that by Assumption |2] we have: 

m\VK^{Wk^^^l)\\^ < " ll-^M-lll" + (75) 

Combining (|72), (|73), and (|75]l, we obtain for A: = 1, . . . , A^: 

¥.{\\^Pk,^f\^^-l} < A + (76) 

7 j Global MSE Recursions: We now combine the MSE vectors at each node into global MSE vectors as: 

We can then rewrite ( |7T| i, and (|76]l as: 

E{W.|^._i} ^ ATE{y,|J-,_i} 

where x < y indicates that each element of the vector x is less than or equal to the correspondent element of vector 
y. Using the fact that if x < y then Bx < By for any matrix B with non-negative entries, we can combine the 
above inequality recursions into a single recursion for Wi'. 

E{W.| -F.-i} ^ /^.A^VV.-i + ^x^{i)alA^tN = P^A^yV^-l + ^i^{i)(JltN (77) 

Now, we multiply both sides by , where p is the right eigenvector of A associated with eigenvalue one. Let p 
be normalized so that t^p = 1. This yields the following scalar recursion: 

E{pTw,| < p.p'yV^-i + < (1 - 2A,„i„eA^(i))/w,_i + (78) 

where < e < 1 for sufficiently large i. 
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2) Global Convergence: It is still not clear what the effect of the aggregation step ([TOji is on the convergence 



of the algorithm. We require the following lemma from |21 p. 45] 
Lemma 1: Let 

v{i) < (1 - a{i - l))v{i - 1) + I3{i - 1) 



V aii) = oo, < a{i) < 1, pii) > 0, 4^ ^ 0. 
Then, lim v{i) < 0. In particular, if v{i) > 0, then v{i) —^0. ■ 

i— f oo 

For the first part of Theorem [T] (asymptotic mean-square convergence), we take the expectation of both sides of the 
inequality in ( |78] l over the past history J^i^i. 

E{/>V.} < (l-2A„,i„e^(z))E{pT>v,_i} + ^2(^)^2 ^79) 

Noting that 

2AniineA*(«) = lim = 

'i— 1 

we then invoke Lemma [T] to arrive at the desired result since < 2Ami,ie/i(i) < 1 for large enough i. For the 
almost sure convergence statement, we call upon a stochastic counterpart pi[ pp. 49-50] to Lemma [T] 
Lemma 2: Let there be a sequence of random variables Dq, . . . , f i > 0, Ei^q < 00 and 

^{v{i)\v{Q), . . . , v(i - 1)} < (1 - a{i - l))v{i - 1) + P{i - 1) 

^ a(i) = CX), < a(i) < 1, > 0, ^ ^0, ^ /3(i) < 00 

Then, — > almost surely. ■ 
We see that ( |78] l fits the form of Lemma [2] so we conclude that p^Wi — > almost surely, so Wi — > almost 
surely as well since all the entries of p are strictly positive when Assumption [3] is satisfied. This also implies that 
Wk.i w° almost surely for all /c = 1, . . . , A^. 

B. Proof of Theorem [2] 

We now observe that the matrix Bj can be written as: 

Bj = ({lN®lM-^i{t){lN(E)W^J{w'')){A(E)lM)) = (A (g) Im) - (A ® ^(^)VV(^(;°)) 

= ^ (g) (/m - 

Then, the weighting matrix of the transient term (first term in (|35]l) can be written as: 
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i-1 



i-1 



Let Ki denote the following diagonal matrix, which appears in ( |80l ): 

i-1 i-1 



(80) 



(81) 



Now, using ([39]l, we conclude that 



(82) 



where the first term is a constant and does not vary with i. On the other hand, all other terms vary with i. We now 
introduce the following lemma that relates the norm of a matrix power to the power of its spectral radius. 

Lemma 3 (Bound on the norm of a matrix power): Let A denote a matrix whose sp-ectral radius p{A) is strictly 
less than 1. Then, 

1~ 



where n € N and c is some positive constant. 
Proof:: From p7| p. 299], we have: 



m<c-{p{A)+er 

for any e > 0. We now let e = (1 — p{A))/2 and we get the desired result. 

We can now see that all the time-varying terms in (|82]i decay to zero at least at an exponential rate: 



\\YD'^^^R^Ekktp''h < WYD'^^^R^h ■ \\Ekktp''h 
\\pt'EkkRD'^"^\Y^h = \\YD'j^\R'Ekktp''h 
\\YD'j^\R'EkkRD'^'j^\Y^h < \\Ekkh ' WYD^j^^R^g 
But using Lemma [sj we can obtain a bound on \\YD'j^^j^R^\\2, a common factor of the above inequalities: 



lYD^j^^R^h = 







i-1 









T 








Dn-1 





<c-||r||2-llT- 










'( 








Dn-1 





i-1 



ITII 



|T-i| 



p{Dn-i) + 1 



Since it is assumed that the matrix A is primitive, then by the Perron-Frobenius theorem, the spectral radius of 
Dn^i is strictly less than 1. We can see therefore that all terms, with the exception of pl^E'/c^lp^, will decay 
to at an exponential rate. For this reason, we will ignore these terms as the convergence rate will be dominated 
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instead by a slower term that decays at the rate i ^Amin^i ggg i-jjjg^ fjj-g)- write down the transient term 

as: 



Ie {w^ {pp^ (g) $A'i$'r) u>o} 



(83) 



where the last step is due to the fact that l^E't^l = 1 since Ekk contains a single 1 at the (fc, k) entry. We now 
introduce the linear transformation of the initial error vector Wq, denoted as w'^. 

This transformation allows us to simplify ( (83] l as 

ENo|lBT...eT_^se._i...6i ~ i^oiP ® *) (1 ® K,) {p ® <^)'^wo} 



(84) 



The only remaining dependence on the iteration is now embedded in Ki as defined in ( [8T| ). Examining this diagonal 
matrix we have that it is in the form: 



K, 



Aim;i(i-Mj)Ai)^ 



A2npi(i-M(j)A2 



AAfn;=i(i-MO-)AM)^ 

In order to determine the rate of convergence of the matrix Ki, we appeal to the following lemma. 

Lemma 4 (Bounds and identities on finite products): Let /i > 0, A > 0, and i be large. Then, it holds that: 



(^-AA^P--(i-rA^) 



< 



2En^^'iog(i-^) 



2(rApl+2) 



(85) 



Furthermore, let I < j < i, then 



n 1 

t=j+i ^ 

where T{x) is the gamma function |[30] 



A/i 

IT 



^(^ + 1)2 ■ r(j + 1 - A^)2 



(86) 



Proof:: See Appendix B-A 
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We can now use Lemma |4] and specifically ( |85| ), to find that 



e 

Using these results along with ((84]i, we have: 



A. n (l - ^) ^ (l - ^) • ([A.Ml - AfeM + ^r''" 



IE||'Ji'o|leT...eT_^sB._i-.-6i 

^ 2 ^ 7 / ^2(rA„..H2) EK)^, (88) 



^\\'^n\\Bj---Bj_^^B,-i--Bi 

^ 2 ^ \ ' , — vm^^ ^(^o)™ 

(.-l-A„,^)2A,„,.(i__A^) 

Notice that the expressions (1 — Afe/i/i)^* and (1 — Afe/i/(i — 1))^* ^ asymptotically converge to e^^'^'=^, which is 
independent of i. In fact, the expressions in ([88|-([89| account for the increase in the excess-risk at the beginning 
of the iterations. Eventually, however, the denominator terms of the form {i — XkfJ.)^^''^ and + 1 — A^/i)^'^''^ will 
overtake the increase in (1 — A^/i/i)^* and (1 — \kfJ./{i — 1))^' ^ and the excess-risk will begin to decay from 



that point onwards. Furthermore, examination of ((88]l-(|89]l shows that the m-th term decays at the rate 0{i ^^im^)^ 
making the slowest decaying term vanish at the rate of 0(i~^^'"'"^). 

C. Proof of Theorem |5] 

In this section, we wish to learn the convergence rate of the asymptotic term (second term in ((35|)). As we 
will learn, this term actually determines the dominant convergence rate (i^^) of the diffusion algorithm. With 

E = ^-Efcfc ® J(w°), we rewrite the asymptotic term as 
/ 



i-1 \ / i-1 



^"1 n n ^i 



i-1 I i-1 



^ 'i If LA. h J. 1 

i-i 

2 51 '^^{rV^T-^EkkT-^ B^'~^ T^^^ 



Tr I n (/m - A^(<)A)A [] (/m - M(i)A)$^ ] (90) 

t=j+i t=j+i 
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where (|90| is due to the trace property Tr{A (E) B) = Tr{A) ■ Tr{B). It is now advantageous to realize that the 
matrix ^^(j) Ht^j+il^M -M(*)A)Ani=j+i(^A/ is diagonal with entries fi'^{j)Xm I\l=]+i{^ - fJ-{t)Ki)^ 

on the m-th diagonal entry. This means that we can simplify expression (|90| to 

/ / \ / \ TN 

i-1 



e^'wtm^i n n 



/ \t=j+i 



j^l m— 1 
m— 1 J — 1 



i-1 



i-1 

ri-l 



m=l t=J + l 

i— Irp— 1 77i_ _ ^^ — T T-lT^^JrrnT /^T ] 



Tr I D'-^T-'EkkT-^D'^' ' R.'PUmT ] (91) 



Now we observe that the product nt=j+i(l ~ /^(i)Am)'' is exactly the one that appears in ( [86| l and is described 
by ratios of Gamma functions: 



n (1 - Kt)K^f = 
t=j+i 



T^i) r2(j + i-A„/i) 

We now call upon the following lemma that describes some properties of the Gamma function including some 
asymptotic expansions: 

Lemma 5 (Properties of Gamma functions and asymptotic expansions): 



T{s + a) „ 
lim ^—rv^s = 1 

is|-i.oo r(s) 



■J\/-l 

E 

.m=0 



(-i)"-r(i-s + TO) 
X™ • r(i - s) 



o(|. 



-A/\ 



(92) 



(93) 



for \x\ — ^ cx), — 37r/2 < arg(x) < 37r/2, il/ = 1, 2, . . . and where r(s,x) denotes the upper incomplete Gamma 
function. 

Proof: See (30). ■ 
Notice that the first fraction will asymptotically converge as i-2A,„p according to (|92]). Therefore, we have 

r^(j + i) 



M'(j)A,n n (1-MWAm)' 
t=j + l 



j2.r2(j + i-A,„M) 



(94) 



Substituting (|94| into ( |9T] l, we have 



i-1 



Em^(^)tm^i n n 



M 



i-1 



p.^ ^ A. 

m=l J 



~ 9 72A„u 



2 ^ ,2A„M^r2(j + l-A„,A*) 



Tr (D'-^T-^EkkT-^D^' ' {<i>'^ R,<S>)ramT) (95) 
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We use Jr{A^BCD'^) = vec{A)~''{D (g) B)vcc{C) to expand the trace as 



E^'WTr r n n 



2 



J2 y^c{T^i'l>'^Rv<i>)mmTf\r. 



m— 1 



E 



-r2(j + i-A,„/i) 



{D®Dy 



(96) 



We can now use the decomposition from (HOl to expand the sum as 



_ ■-2A„^ 



E 



-r2(j + i-A„/i) 



JV-l 



(8) 



E 



r2(j + i- A™/i) 



i-1 

^r2(j + i-A„,Ai) 



X]7 = l 



r^(j + i-A„fi) 



r'^(i+i-A,TiM) 



r (j) n'~3 san'^^ 



r^(j) 



-1 T^ii) 



r^(j + i-A„n) 



(97) 



We observe from Lemma |6j that the first product j^2A„,^ r^(j+i-A ^^^^ converge asymptotically to 

i~Y(2Am/x — 1) when 2Am/x > 1 and log(i)/i when 2Am/x = 1. When 2Am/i < 1, we notice that the diffusion 
algorithm can achieve an arbitrarily small convergence rate. This means that the fastest rate at which the asymptotic 
term can converge at is i^^, achieved when 2A,iiin/i > 1, so long as the non-constant entries also decay. To show that 
this is the case, we show that the second to fourth matrices along the diagonal will converge to zero asymptotically: 



Ei—l 



D 



Ei— 1 
.7=1 



r^o+i-A™/.) 



< 



Ei— 1 
.7 = 1 



r2(j+i-A„p) ii-^AT-i 



D 



.7 = 1 



j=i r^j+i-x,„n) 



< c- 



Ei— 1 
.7 = 1 



r"(3) f p{Dn-i) + i \ 

j=i r^(j+i-A„.p) V 2 J 
-1 r^O) 



j=i r20+i-A„M) 

In order to evaluate the rate of decay of the above terms, we appeal to the following two lemmas: 



(98) 
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Lemma 6: The sequence i 2-^™^ ^^^^ r^{j+i-\ n) 



^^r2(j + i-A„,/i) 



log(') I 



satisfies: 

5- + e(i-2A™M) 



F2(l,l:l;2-A„jj,2-A„/f;l) 



2Am^ > 1 

2AmM — 1 



-2A„p 



r(2-A„.A')" * + ^-^"^ < 



as I 



oo, where 3^2(01, 02, 03; 6i; 62; is the generalized hypergeometric function [30 , p. 1010]. 



Proof:: See Appendix |B-B| 
Lemma 7: When < p < 1, the sequence i^^A^p — ^ 



r2(j + i-A™Ai) 



(j+i-A™m) 

-2 



Specifically, 



log(p-i) 

Proof:: See Appendix |B-C| 



< lim 35- • 



-2A„^t 



= e(z-^), 



satisfies: 



as z — ?> 00 



< 



r2(j + i-A™p) - log(p-i) 



From Lemma |6j we have that the denominator of ( |98| l will converge: r^(j+i-A ~^ i2A,„^-i ^ constant 

when 2Am/i 7^ 1 and X]j=i r^(j+i-A /j) ^ l'^g(*) + constant otherwise. On the other hand, the numerator will 
converge at the rate according to Lemma [t] regardless of 2AmM long as p{Dn^i) < 1, which is clearly 

satisfied here when the combination matrix A is primitive. This implies that the second and third matrices along 
the diagonal of ( |97| ) will converge to the zero matrix asymptotically at a rate of i^2A„,^-i ,^jjgjj 2A,„/i ^ 1 and 
i"'^ / \og{i) otherwise. In a similar manner, we observe that the final matrix will also converge to the zero matrix: 



Ei— 1 
.1 = 1 



r2(j+i-A„M) ^'-i 



Ei—l 
.7 = 1 



r2(j+i-A,„M) 



Ei — 1 
.7 = 1 



< 



< C- 



^ \\ni-o 112 



Ei—l 
.7 = 1 



r20+i-A,„M) 



^.7=1 r2(.,-t-i-A,^.M) V 



p(Dn-i) + 1 

r^b-t-i-A,„M) V 2 



Ei— 1 
.7 = 1 



The same results apply in this case, and this matrix will also converge to the zero matrix asymptotically at a rate 
of j-2A,„7i-i yyjjgjj 2XmiJL 7^ 1 and i^'^ /\og{i) otherwise. Since all the non-constant matrices in will converge 
to zero at a relatively high rate, we can now have the following asymptotic relationship for ( |97] l 



i-l 

-2A„/i 



^r2(j + i-A„,p) 



{D®D) 



where 



2A„,7J-1 ' 



am(i)-E^ii ® £^11 

2A,„/i > 1 
2A„/i = 1 



(99) 



(100) 



3F2(1,1.1;2-A„M,2-A„.7t;l) •-2A„7i ox „ ^ -, 
r(2-A,„7i)2 ' ' ' ^Am/X <. 1 
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Now, substituting ( |99) back into ( |96| ), we have 

i—l / I i—1 \ / i—1 



E/^'WT^ K n n 

j=i V \t=j+i J \t=j+i 



2 



M 



m—1 
M 



= y II A,„a„(z)Tr {{'^''^ R,^)^,nTEiiT-^EkkT-^ Ei^T^) (101) 

m— 1 

where the second equality is due to Tr{A^ BCD'^) — vec{Ay{D ® B)vec{C). We now observe that TEnT^^ is 
a rank-1 matrix that is spanned by the left- and right-eigenvectors of A corresponding to the eigenvalue 1. The left 
eigenvector is Iat since A is left-stochastic. Denote the right eigenvector by p and normalize the sum of its entries 
to unity; i.e., p^t^ = 1, and Ap = p. Then, we have that TEuT^^ — ptjf. Substituting into ( |101| l we get 

i-l I I j-1 \ / i-1 \ \ 2 



, , , , , , 2 

j=i \ \t=j+i / \t=j+i / / "1=1 



1=1 

M 

Plla I] A™a™(z)($Ti?^$)^^ (102) 



o M 



2 

ra—l 

where the second equality is due to the fact that E^k is an x matrix with a single 1 on the (fc, k)-\h. position 
and curaif) is defined in ( |100| i. We can immediately observe that the slowest rate at which the asymptotic term 
converges depends on the smallest eigenvalue of J(ti;°) and the initial /i. In the case where 2AniinM — 1' we 
see that the asymptotic term will converge at the rate 0(log(«)/i). On the other hand, when 2A„ii,i/i 3> 1, we have 
that 

g M20)Tr [qA n bA S I n I U • Iblli, 2A„,„;. » 1 

as i — >■ oo. Finally, when the Hessian matrix at the optimizer is V^J(w°) = \Im and 2A/i = 1, we have the 
expression: 

EA.)Tr4-f n ^^14 n ^*)U^-^-MI^, 2A, = 1 

as « — >■ CO. 

1) High Probability Convergence Rate: What we have shown so far is that the excess-risk decays as |lp|l2/i 
when 2AminM > 1- Indeed, this result can be further strengthened to show that the excess-risk will converge at that 
rate under high probability. In order to accomplish this, we notice that we have shown: 

E{J(-u;fc,,_i) - Jiw")} = e (Ml\ ^ as I ^ oo 



when 2AminM > 1- Introducing > 0, we have 

e(Mi) 
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Now, utilizing Markov's inequality |26 p. 151], we have that 



Pr I J(i«;fe,,_i) - J(u;°) < 6 (^Ml^ | > i _ -^^^ = 1 - z^, as z ^ oo 



In fact, when 2Ai„i„/x > 1 and p — ^1 (as we will observe in Sec. 

M 

2N 

and with probability at least 1 — ly, 



N ■ 



IV I, we have 



,2 



M 



J{wk,i-i) - J{w°) < I ^ A„a„(z)($"^i?„$)„„ I , as i 00 

\m=l I 

The reason why we do not cover the case where 2Ainin/U < 1 is because in this case the transient term also decays 
at the same rate as the asymptotic term, so we would need to examine the bounds on the transient term as well. 
This case is not of interest since we usually wish to attain the fastest asymptotic rate OdlpjU/i). 

D. Proof of Theorem |4] 

We write down the eigenvalue decomposition of Bi as = Y!k=\ I]m=i '^\.-mVV,Ta^k.ra{Bi), where Afc,m(Si) 
is listed in Table [ll] for the diffusion and consensus strategies. Furthermore, since the eigenvectors are orthonormal 
(due to Assumption]?]), we have that the finite product of Bt matrices is: 

i-l N M i-1 

n ^* = E E <myi1m n ^103) 
t=j + l fe=lm=l t=J + l 

We may now substitute ( |103| l into ( ]6T] i and use the fact that the eigenvectors are orthonormal to find that the 
expected excess-risk asymptotically satisfies; 

2 N M I i-l i-l 

= 1^ E E • ll^fcll' ■ E 72 n • y^nM, 

fc=lm=l \j = l t=j + l 

By substituting the values from Table ]ll] we can find the asymptotic expected network excess-risk for the diffusion 
and consensus strategies as: 

ER'^'"«-|^EE^™-ii-^™iiL-ik.iiMi2/.ii^E^^ n (1-^) 

fc=lm=l i = l t=j + l ^ ^ 

ER-w = |^EE^™-ii^™iiL-ii-^-iiMi^^ii2-E^^^ n (1-^^^) 

k=lm=l j = l •' t=j + l ^ ^ 

We now utilize Lemma |4] and the asymptotic expansion in ( ]93| ) to find that the finite product above can be re- written 

as 

,,2 N M i-l ^2(1- j) p2/•^ 

ER''^^) = 9^E E huwi-Y. 



fe=lm=l j = l '"'^'^ 

2 N M i-l jj2(i-j-l) ^2/■^ 

ER^-w^^^EE^'-ii-'-iiL-KiiMbfcii^E 



O l\r '™ ll"m||K„ II'K|I2 . , . n .tx „n-i 

2^ ^1 ^1 r2(j + 1 - XrnflDj) ■ 
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We observe that the equations for the asymptotic expected excess-risk for the diffusion and consensus strategies are 
identical except for the most inner summation. When Dkk = 1, the summands inside the first sum are identical. In 
fact, this slight difference is the key to the performance difference between the two algorithms. In order to show 
that the expected excess-risk of diffusion networks is always below that of the consensus strategy, we will compare 
each term in the sum to show that each term the expected excess-risk in the consensus strategy is lower-bounded 
by an upper-bound on the excess-risk of diffusion networks. To see this, we utilize Lemma [7] to bound the above 
finite sum (for D^k < 1, which occurs for all eigenvalues except Di i when A is primitive): 



n2 , „-2A,„u-2 --2 ''-1 ■2A„,u_D. .^-2 n^2 

^kk .-2 < V n2('-J) -^ < ' < V n2('-J-i) -^ < ^kk 



■2 



as i — ^ oo. Therefore, each term inside the sum is lower bounded (for consensus) by the upper bound for diffusion. 
Hence we have ER™"'*(i) - ER**'(i) > 0, as i ^ oo. 

Appendix B 
Proof of Lemmas 

In this appendix, we provide proofs for the lemmas in the manuscript. 

A. Proof of Lemma [5] 
For ( |85| l, we have 

Furthermore, using the integral bounds (for any increasing function f{x)): 



we have 



Evaluating the integrals, we obtain the bounds 

J log (^l^^^dx^i log (1-7^) - AMlog(« - A/i) 



-([A/il +l)log( 1-^^^-^ )+A/ilog(rAH -A/. + 1) (105) 



and 



log f 1 - ^ = Aa^ log( [A/il - A^i + 2) + + 1) log f 1 

rAMi+2 \ X J V 



Am log(^ + 1 - A/i) - ( [A^l + 2) log ( 1 - ^ ] (106) 



i + 1 
fVl +2 
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Substituting ( |105| l-( |106| ) into ( |104| i, we have: 



For ([86|, we observe using T{x + 1) = x • r(a;) that 



and, therefore, 

A , 2^ rp- + 1 + 1 - A/i)2 r(j + 2 + 1 - Am)^ r(^ + 1 - A/i)^ ^ r(» + 1 - Am)^ 
ro- + i-A^)2 r(j + 2-A^)2 r(z-AAi)2 r(j + i-AAi)2 

Then, 

JJ +2 



A/i 

t=j+i ^ ' t=j+i 



^(^ + 1 - A^)2 r(.7 + 1)2 



ro' + 1 - Xfiy r(z + 1)2 

B. Proof of Lemma |6| 

We must consider the series i^^A^^ r(j+i-A ^J.y^ under different conditions on 2A,„/i, namely: 2Am/i > 1, 

2Xml^ = 1, and 2A„i/i < 1. We will use the following integral bounds for a positive monotonic function /(x): 



/ /(a:)dT<^/(z)< / f{x)da 



Now, observe that due to ( [92] ), we have that for any e > 0, there exists integer q > I such that for all j > q, the 
following is satisfied: 

1 _ £ < j2-2A,„M < 1 + g (107) 

r(j + 1 - A™^)2 

Therefore, we may divide the sum into two parts, j < q, and j > q: 



-2A, 



rCi -^ 1 - „^2 fCt -H - \ ,A2 ^ 



- r(j + 1 - A,„Ai)2 I ^ ro' + 1 - A„Ai)2 r(j + 1 - A,„/i)2 



T-^.-^2 



r(j + 1 - A„,/i)2 



Using ( |107[ l, and for the case where 2A„i/i 7^ 1, we can obtain the upper-bound: 



-2A,„M ^ 0^,— 2A,„pN I /-I I \ -2A„M \^ „-2A„A'-2 

^r(j + i-A,„/i)2 J + i^ + ^i* 



< e(z-2^'"'') + (1 + e)i-2A™M / a;2A™A-2^^ 
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= (i + .)-^I^ + e(r2^'"^) 

and using ( |107[ ) again, we can obtain the lower-bound for the case where 2X„ifi ^ 1 (by taking the Hmits of 
integration to be {q, i — 1}): 

r(j + 1 - A,„/i)^ 2A„/i - 1 

Notice that this implies that when 2Am/i > 1, limi^oo ji~2A„,p p(-j_^^'f^ = ^^-i ' '■^^ other hand, 

our lower and upper bounds indicate that when 2AmA* < 1' then j-2A,„p X]}=i r(j+i-A p)^ ^ as 

i — > 00. In fact, it is possible to derive the exact function that i-2A„,/j J2]ZX r(j+i-\ converge to when 
2X„ifj. < 1. In order to accomplish this, we compute the following limit: 

-2A„,MV*"^ °° -T^ ■\2 



Z'2A„M r(j + 1 - A,„A*)^ 



It is still not clear if the series J27Li r( •+i-a — W converges. To examine this, we recast the series as a scaled 



generalized hypergeometric function 1 30 



P- 1010] ET=i ruA-L.y^ - ^^^^^'^■^^^r'"" ■ known that, 
on the circle \z\ — 1, the series 3^2(01, 02, a^; 61, 62; z) is convergent |40 p. 45], |41 pp. 85-86] if D\t{{bi +62) ^ 
(fli + a2 + 03)} > 0, which is satisfied in this case as 2AmM < 1- This implies that i-2A„^i ^*r^ — > 
3F2(i,i,1;2-^Aw^^2-a„.m;1) . ^-2A„p asymptotically when 2X,nfi < 1. 
Finally, for the case where 2A„i/i = 1, using ( |107| i, we can obtain the upper-bound: 



J=9 

(i + .)i^ + er^) 



9-1 



a; 



and using ( |107| i again, we can obtain the lower-bound (by taking the limits of integration to be {q, i — 1}), for the 
case where 2\,nii = 1: 

r(j + 1 - A,„/x)2 - i 



'^^"^^ ^^^^1" > (i_,)^^ + e(ri), as* 



Notice that this implies that when 2A,„^ = 1, limj^oo ''\ogi^{ ^]=i v{j+l-\„,^i)'^ = 1- 
C. Proof of Lemma [7| 

In this lemma, we are interested in the rate at which the series i^2A„,^ rx^T^^^w^F '^^^^^^?>^^ when 

< p < 1. Notice that due to ( |92] i, we have that for any e > 0, there exists integer q> 1 such that for all j > q, 
( |107[ ) is satisfied. Therefore, we may divide the sum into two parts, j < q, and j > q: 

* r ^ i(7 + 1 - A„,/i)^ 



^ r(j + 1 - A„/i)2 I ^ r(j + 1 - A„,/i)2 ^ r(j + 1 - x„,^,y 



, ■2A„A'-2 . ■2-2A„.p 1"(-^')^ 



r(j + 1 - x^^^y 
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Using ( |107[ l, we can obtain the following upper-bound: 



-ro- + i-A,„A*) 



r(2A„Ai- i,(g-i)iog(p)) r(2A„^- i,nog(p)) 



l0g(p)2A„,M-l log(p)2^'"A'-l 



^ ^ ^2A„M . log(p)2A™M-l ''^'^^ 



where (a) is a consequence of the indefinite integral in |30 p. 108]. Step (6) is a consequence of the asymptotic 



expansion of the upper incomplete Gamma function listed in (|93]l. The lower bound is derived in a similar way, 
except by taking the integral limits to be {q, i — 1}: 



^r(j + l-A™M)' - V ; V ^log(p-i) 



where (a) once again is a consequence of the indefinite integral in |30 p. 108] and step (&) is a consequence 



of the asymptotic expansion of the upper incomplete Gamma function Usted in ( [931 ). Therefore, we conclude that 

P < lim- -2A„M-2 y-*-i p"'-r(3)" < 1 
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